Breast cancer is the cancer which develops from the breast tissue. It is the second leading cause of cancer death in women (the first leading cause of cancer death is the lung cancer). The death rate of the breast cancer is about 1 in 39 (American Cancer Society, 2021). The correct prediction of the cancer is benign or malignant could have significant impact on further decisions, such as further screening and preventative actions (Stark, Hart, Nartowt and Deng, 2019). In this report, we are trying to use about 11 different variables to predict the binary outcome (benign or malignant) of the diagnosis of breast cancer. We hope to build a model that performs well on predicting the nature of the breast cancer based on different predictors such as the mean radius, perimeter and the area of the tumor.
The Breast Cancer dataset from Kaggle consists of 569 observations and 32 variables, including an ID variable, a diagnosis variable revealing the tumor status (benign or malignant), and other 30 different measurement variables. We have removed the ID column and the “X” column which both are meaningless in predicting. For these 30 predictors, they are generated by 10 types: * radius
* texture (standard deviation of gray-scale values)
* perimeter
* area
* smoothness (local variation in radius lengths)
* compactness (perimeterˆ2 / area - 1.0)
* concavity (severity of concave portions of the contour)
* concave points (number of concave portions of the contour)
* symmetry
* fractal dimension
The mean, standard error and “worst” or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. (Breast Cancer Wisconsin (Diagnostic) Data Set, 2021).
However, we only kept 11 of the 30 variables from the original dataset, because some of the variables are too informative. As the result, we kept the standard error of concave points, the standard error of fractal dimension, the standard error of symmetry, the standard error of concavity, the standard error of texture, the mean of smoothness, the mean of symmetry, the mean of texture, the worst of symmetry, the worst of texture, and the worst of smoothness as our predictors. Since these predictors are related to tumors, but they are not very informative and would not determine whether the tumor is malignant or not by themselves, we believe these 11 variables are good predictors for our prediction. In addition, we also want to investigate how accurate we can predict when we only have partial and less related information.
For data tidying, we simply converted variables diagnosis from class character into class factor for further analysis. Then, we have checked the number of missing values, and luckily, there are 0 missing value in our data. Furthermore, we have converted the outcome benign as 0 and malignant as 1 for convenience. We separated the data into the training(80%) and the testing(20%) sets.
We use exploratory analysis to further show the distribution, characteristics ,and interesting structure of the dataset. Several plots are used in this part.
From the plot above, it shows that among 569 observations, 357 observations are diagnosed as benign (about 62.7%) and 212 observations are diagnosed as malignant (about 37.3%).
From the correlation plot above, we can see that among 11 predictors, some of them are correlated with each other, suggesting that multi-collinearity is likely to occur. This may cause problems such as over-fitting and difficulties in interpretation.
The plots above show the distributions of all 11 predictors.
We have divided the dataset into the train set and the test set for further checking performance of the models.
Since the outcome variable is binary and predictors are highly correlated with each other, we fit a logistic regression model and a penalized logistic regression model to the data.For logistic regression, the underlying assumption include: observations are independent of each other;little or no multicollinearity;linearity of independent variables and log odds.
## Confusion Matrix and Statistics
##
## Reference
## Prediction Benign Malignant
## Benign 65 8
## Malignant 6 34
##
## Accuracy : 0.8761
## 95% CI : (0.8009, 0.9306)
## No Information Rate : 0.6283
## P-Value [Acc > NIR] : 3.585e-09
##
## Kappa : 0.7321
##
## Mcnemar's Test P-Value : 0.7893
##
## Sensitivity : 0.8095
## Specificity : 0.9155
## Pos Pred Value : 0.8500
## Neg Pred Value : 0.8904
## Prevalence : 0.3717
## Detection Rate : 0.3009
## Detection Prevalence : 0.3540
## Balanced Accuracy : 0.8625
##
## 'Positive' Class : Malignant
##
For testing data, confusion matrix gives us an accuracy of 87.61%. The p-value of accuracy greater than no information rate is almost zero(3.585e-09), thus the model is useful. Specificity(91.55%) is greater than sensitivity(80.95%). However, since data contains eleven predictors which is too many, we use regularization approach which help shrinkage unimportant coefficients by lambda, thus penalizes large number of predictors.
## Area under the curve: 0.9202
## alpha lambda
## 32 0.2 0.00200818
For this penalized logistic regression model, the optimal tuning parameter being selected is 0.002. I picked this value by setting a tuning grid from exp(-12) to exp(-2) and then uses cross validation to get the parameter which gives the optimal ROC curve.
For testing data, the area under ROC curve is 0.9202. The coefficients of predictors below provide us some information on how predictors affect the outcome variable: malignant or not.
## 12 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -18.9342258
## symmetry_mean -4.4474184
## texture_mean 0.1011702
## texture_se -1.7326574
## texture_worst 0.2222315
## smoothness_mean 70.3861777
## smoothness_worst -2.4300359
## symmetry_se -129.4627250
## fractal_dimension_se -322.2466015
## symmetry_worst 21.1204194
## concavity_se -8.8000830
## concave.points_se 383.7196728
For one unit increase in standard error of the number of concave portions on cell contour, the odds ratio of having malignant tumor increases by 4.44e+166 (computed by e^383.72). For one unit decrease in standard error of fractal dimension, the odds ratio of having malignant tumor increases by 8.94e+139(computed by e^322.25). If we relate the coefficients to relative variable importance, we can conclude that variability of concave points are the most influential variable, whereas the mean of texture is the least influential variable since it’s coefficient is the closest to zero.
Since many predictors are highly correlated to each other, we also fit a LASSO regression. The assumption of lasso model is the linear relationship between predictors and the odds.
## alpha lambda
## 11 1 0.007447329
## Area under the curve: 0.9209
For this LASSO regression model, the optimal tuning parameter being selected is 0.00745. We picked this value by setting a tuning grid from exp(5) to exp(-5). In this model, the parameters of the mean of symmetry and the worst of smoothness have been reduced to 0. Besides, the AUC is 0.9209.
For one unit increase in the mean of smoothness, there is an exp(57.98) increase in the odds ratio. For one unit increase in the standard error of symmetry, the is an exp(-93.36) decrease in the odds ratio. For one unit increase in the standard error of fractal dimension, the odds ratio decreases by exp(-287.63). For one unit increase in the standard error of concave points, the odds ratio increases by exp(309.52). Thus, we can conclude the standard error of concave points and the standard error of fractal dimension are the two most important variables.
## k
## 4 16
Since the model could be nonlinear, we also fit the MARS model besides the random forest. Since MARS is a non-parametric method, there is no assumption made between the outcome and predictors.
## nprune degree
## 12 13 1
## Area under the curve: 0.944
We use the tune grid which degree from 1 to 3 and prune from 2 to 20 to choose the model with least cross validation, and the tune with 8 prunes and 1 degree has been chosen. The worst of texture and the standard error of concave points are the two most important variables. The AUC for this model is 0.925.
Support vector machines model is an efficient learning algorithms for non-linear functions. Because our final model could be non-linear, we train SVM model with radio kernel.We use radio kernel as our classifiers instead of using linear kernel because our data set is not fully separable.
## Support Vector Machine object of class "ksvm"
##
## SV type: C-svc (classification)
## parameter : cost C = 17.9733281381951
##
## Gaussian Radial Basis kernel function.
## Hyperparameter : sigma = 0.0117436284570214
##
## Number of Support Vectors : 138
##
## Objective Function Value : -1912.636
## Training error : 0.096491
## Probability model included.
## Confusion Matrix and Statistics
##
## Reference
## Prediction Benign Malignant
## Benign 64 5
## Malignant 7 37
##
## Accuracy : 0.8938
## 95% CI : (0.8218, 0.9439)
## No Information Rate : 0.6283
## P-Value [Acc > NIR] : 1.762e-10
##
## Kappa : 0.7748
##
## Mcnemar's Test P-Value : 0.7728
##
## Sensitivity : 0.9014
## Specificity : 0.8810
## Pos Pred Value : 0.9275
## Neg Pred Value : 0.8409
## Prevalence : 0.6283
## Detection Rate : 0.5664
## Detection Prevalence : 0.6106
## Balanced Accuracy : 0.8912
##
## 'Positive' Class : Benign
##
For support vector machine with a radial kernel, training and testing errors are 0.096 and 0.106 respectively.
For new observations, we can observe the importance of features by looking at their color for each case. As we can see, large value of ‘standard error of concave points’ is highly associated with malignant diagnosis, which corresponds to our conclusion from penalized logistic regression(elastic net). Moreover, worst symmetry is the second most influential variable associated with malignant diagnosis.
LDA model assumes the predictors are drawn from the normal distribution, and the LDA further assumes equality of covariances among the predictor variables.
Since ‘fractal_dimension_se’ and ‘concave.points_se’ are the two most important predictors, I select them to check the equal variance assumption. As we can see, the variance of diagnosis group Malignant is similar to group Benign because their patterns are equally dispersed. Moreover, LDA is pretty robust to normality assumption which is violated for many predictors, I decide to build linear discriminant analysis model using lda function.
## Area under the curve: 0.9252
The area under the ROC curve is 0.936.
##
## Call:
## summary.resamples(object = resamples(list(Elastic_net = model.glmn, KNN
## = model.knn, MARS = tuned_mars, RF = rf.fit, SVM = svmr.fit, LDA =
## model.lda, LASSO = model.lasso)))
##
## Models: Elastic_net, KNN, MARS, RF, SVM, LDA, LASSO
## Number of resamples: 10
##
## ROC
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## Elastic_net 0.8907563 0.9371197 0.9504491 0.9467183 0.9594139 0.9858012 0
## KNN 0.8661258 0.9203854 0.9390756 0.9324290 0.9542162 0.9574037 0
## MARS 0.8427992 0.9396552 0.9576572 0.9508258 0.9774160 1.0000000 0
## RF 0.8722110 0.9191901 0.9356165 0.9378332 0.9680527 0.9873950 0
## SVM 0.9290061 0.9490546 0.9648290 0.9597146 0.9710953 0.9810924 0
## LDA 0.9107505 0.9300022 0.9433135 0.9429948 0.9524051 0.9736308 0
## LASSO 0.9107505 0.9275210 0.9543611 0.9440814 0.9578383 0.9684874 0
##
## Sens
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## Elastic_net 0.8571429 0.8937808 0.8965517 0.9158867 0.9310345 1.0000000 0
## KNN 0.8214286 0.8349754 0.8965517 0.8880542 0.9285714 0.9655172 0
## MARS 0.7931034 0.8928571 0.9125616 0.9124384 0.9310345 1.0000000 0
## RF 0.7931034 0.9285714 0.9476601 0.9373153 0.9913793 1.0000000 0
## SVM 0.8928571 0.8965517 0.9125616 0.9195813 0.9285714 1.0000000 0
## LDA 0.8214286 0.8937808 0.9125616 0.9158867 0.9559729 1.0000000 0
## LASSO 0.8620690 0.8697660 0.9119458 0.9128079 0.9559729 0.9655172 0
##
## Spec
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## Elastic_net 0.6470588 0.7205882 0.8235294 0.8117647 0.8676471 1.0000000 0
## KNN 0.7058824 0.7647059 0.7941176 0.8058824 0.8676471 0.8823529 0
## MARS 0.6470588 0.7647059 0.8235294 0.8352941 0.9411765 1.0000000 0
## RF 0.7058824 0.7647059 0.7941176 0.8058824 0.8676471 0.9411765 0
## SVM 0.7647059 0.7647059 0.7941176 0.8294118 0.8676471 1.0000000 0
## LDA 0.6470588 0.7647059 0.8235294 0.8117647 0.8823529 0.9411765 0
## LASSO 0.7058824 0.7647059 0.8235294 0.8117647 0.8235294 0.9411765 0